Full refactorization of modelling and visualisation functions #200

ntorresd · 2024-08-01T21:39:45Z

This PR replace the old modelling and visualisation functions for refactored versions agreed on by the development team of the package (@zmcucunuba, @ben18785, @sumalibajaj, @jpavlich, @ekamau) earlier this year.

We designed this new functions for the selection and the specification of the Bayesian models to be more flexible than before, as well as including additional constant/time/age varying models with the possibility to estimate the seroreversion rate $\mu$. In particular, the usage of fit_seromodel() went from (e.g.):

seromodel <- fit_seromodel(
  serodata = serodata,
  foi_model = "tv_normal",
  foi_location = 0,
  foi_scale = 1,
  ...
)

to

seromodel <- fit_seromodel(
  serosurvey = serosurvey,
  model = "time",
  foi_prior = sf_normal(0,1),
  ...
)

Note that we refer to the serological survey now as serosurvey rather than serodata, and we restrict it to have the following data:

tsur: Year in which the survey took place
age_min: Floor value of the average between age_min and age_max
age_max: The size of the sample
sample_size: Number of samples for each age group
n_seropositive: Number of positive samples for each age group

The function sf_normal is one of a set of auxiliary functions designed to specify the prior distributions of the parameters to be estimated (FOI or seroreversion rate), as suggested in #193 . Currently available priors are sf_normal() and sf_uniform(), corresponding to Gaussian and uniform priors respectively.

To illustrate the current pipeline, consider a disease with constant FOI $\lambda = 0.02$ and seroreversion rate $\mu=0.01$:

foi <- data.frame(
  year = 1951:2000,
  foi = rep(0.01, 50)
)
seroreversion_rate <- 0.01

and a serological survey with the following features:

survey_features <- data.frame(
  age_min = 1:50,
  age_max = 1:50,
  sample_size = runif(0,100)
)

$ age_min     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26,…
$ age_max     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26,…
$ sample_size <dbl> 51, 35, 14, 86, 70, 34, 50, 98, 51, 93, 78, 25, 36, 40, 15, 43, 79, 96, 96, 64, 56, 31, 84, 45…

We can use simulate_serosurvey(), introduced in #199, to simulate statistically consistent seropositive counts n_seropositive for each age group, according to our set of serological models. In this case, I use the time-varying model:

serosurvey <- simulate_serosurvey(
  model = "time",
  foi = foi,
  survey_features = survey_features,
  seroreversion_rate = seroreversion_rate
) %>% mutate(
  tsur = max(foi$year)
)

$ age_min        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, …
$ age_max        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, …
$ sample_size    <dbl> 51, 35, 14, 86, 70, 34, 50, 98, 51, 93, 78, 25, 36, 40, 15, 43, 79, 96, 96, 64, 56, 31, 84,…
$ n_seropositive <int> 4, 1, 2, 12, 14, 2, 9, 19, 12, 27, 18, 6, 14, 12, 5, 19, 28, 31, 45, 26, 25, 13, 37, 30, 42…
$ tsur           <int> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2…

which we can visualise using plot_serosurvey:

plot_serosurvey(serosurvey = serosurvey)

Implementing the constant model with and without seroreversion by means of fit_seromodel():

# initialization function for sampling
init <- function() {
  list(foi_vector = rep(0.1, nrow(survey_features)))
}

# constant model without seroreversion
seromodel_constant_no_serorev <- fit_seromodel(
  serosurvey = serosurvey,
  model = "constant",
  foi_prior = sf_uniform(0, 10),
  init = init
)

# constant model with seroreversion
seromodel_constant_serorev <- fit_seromodel(
  serosurvey = serosurvey,
  model = "constant",
  foi_prior = sf_uniform(0, 10),
  is_seroreversion = TRUE,
  seroreversion_prior = sf_uniform(0, 1),
  init = init
)

Visualizing the results:

plot_no_serorev <- plot_seromodel(
  seromodel = seromodel_constant_no_serorev,
  serosurvey = serosurvey
)

plot_serorev <- plot_seromodel(
  seromodel = seromodel_constant_serorev,
  serosurvey = serosurvey
)

cowplot::plot_grid(plot_no_serorev, plot_serorev)

Note that the model estimating the seroreversion rate (right panel in the image), no only is a better fit for this serological survey according to the elpd value, but it also accurately estimates both the FOI and seroreversion rate we used to simulate the serosurvey.

Another difference with previous versions lies in how we visualize the estimated parameters. Since the constant model is estimating a single FOI value, we only show the estimated value with its corresponding credible interval (we use to plot it instead). However, for the time and age varying models we estimate several values of the FOI in the time/age span of the survey, which we visualize graphically. Take for instance the time varying model with seroreversion:

# implementing the time-varying model with seroreversion
seromodel_time_serorev <- fit_seromodel(
  serosurvey = serosurvey,
  model = "time",
  foi_prior = sf_uniform(0, 10),
  is_seroreversion = TRUE,
  seroreversion_prior = sf_uniform(0, 1),
  init = init,
  iter = 4000
)
# plotting
plot_time_serorev <- plot_seromodel(
  seromodel = seromodel_time_serorev,
  serosurvey = serosurvey
)

Here we plot the estimated values of the FOI as a function of time (blue line and shadow) and the R-hat estimates for each estimated value (black dots). Note that, even though we ran the model for a large number of iterations - 4000, the model did not converged (since there are R-hat values over 1.01). We can improve the convergence of the model by estimating less values of the FOI. To specify the time intervals for which FOI values will be estimated, we index them by means of get_foi_index():

foi_index <- get_foi_index(
  serosurvey = serosurvey,
  group_size = 5
  )

> foi_index
 [1]  1  1  1  1  1  2  2  2  2  2  3  3  3  3  3  4  4  4  4  4  5  5  5  5  5  6  6  6  6  6  7  7  7  7  7  8  8
[38]  8  8  8  9  9  9  9  9 10 10 10 10 10

This function returns a list of indexes that labels each year (or age, for age-varying models). The idea is that a single FOI value will be estimated for each index, meaning that 10 FOI values will be estimated in this particular case:

init <- function() {
  list(foi_vector = rep(0.1, max(foi_index)))
}

seromodel_time_serorev <- fit_seromodel(
  serosurvey = serosurvey,
  model = "time",
  foi_prior = sf_uniform(0, 10),
  is_seroreversion = TRUE,
  seroreversion_prior = sf_uniform(0, 1),
  foi_index = foi_index,
  init = init,
  iter = 4000
)

plot_seromodel(
  seromodel_time_serorev,
  serosurvey = serosurvey
)

Now the model converges. As long as foi_index length covers the whole time/age span of the serological survey, less regular indexations foi_index can be specified.

…cation of priors

…tions

…odel' This function allows to extact an specific loo estimate for model parameters like 'seroreversion_rate' or 'foi'. This introduce changes in: - 'summarise_seromodel' - 'plot_summary' - 'plot_seromodel'

github-actions · 2024-08-21T16:51:16Z

This pull request:

Adds 1 new dependencies (direct and indirect)
Adds 1 new system dependencies
Removes 0 existing dependencies (direct and indirect)
Removes 0 existing system dependencies

(Note that results may be inacurrate if you branched from an outdated version of the target branch.)

github-actions · 2024-08-21T18:20:52Z

This pull request:

Adds 1 new dependencies (direct and indirect)
Adds 1 new system dependencies
Removes 0 existing dependencies (direct and indirect)
Removes 0 existing system dependencies

(Note that results may be inacurrate if you branched from an outdated version of the target branch.)

Add examples for: - `fit_seromodel` - `get_foi_index` - `plot_serosurvey`

github-actions · 2024-08-22T16:56:40Z

This pull request:

Adds 1 new dependencies (direct and indirect)
Adds 1 new system dependencies
Removes 0 existing dependencies (direct and indirect)
Removes 0 existing system dependencies

(Note that results may be inacurrate if you branched from an outdated version of the target branch.)

ben18785 · 2024-08-30T16:03:09Z

DESCRIPTION

@@ -1,7 +1,7 @@
 Package: serofoi
 Type: Package
 Title: Estimates the Force-of-Infection of a given pathogen from population based seroprevalence studies


population-based

ben18785 · 2024-08-30T16:08:22Z

R/build_stan_data.R

+build_stan_data <- function(
+    serosurvey,
+    model_type = "constant",
+    foi_prior = sf_uniform(),


I checked the config.yml file and it did seem to have a max default of 10 (which I think is fine); I just wanted to double-check that I read it correctly?

Yes, that's right. The default uniform distribution is sf_uniform(0, 10) and for the normal distribution it is sf_normal(0, 1).

ben18785 · 2024-08-30T16:20:50Z

Hi @ntorresd,

This is so great. Thank you so so much for this work -- it lays the foundation for a whole lot more work for us all and does so in a really elegant way. (Well done team! Here's to Bogota 2025...)

A few thoughts. These are generally minor and could be discussed and handled in separate PRs if this isn't the right forum:

naming:
- tsur -> date_survey? I can imagine how we might, in the future, use surveys taken at different parts of the year (when looking for seasonal effects). This generality might prove useful. At the least, I would suggest we rename this variable to use full proper names.
- sample_size -> n_sample. We consistently use n_x to mean a count of x throughout the package, and I'd suggest we do so here.
I love the seroreversion model example you've given in this PR. Could it form the basis of a package article about fitting models with seroreversion?
With the plotting, is there a way to suppress the written text at the top? I can imagine people wanting to put these into a publication and they may not want the written text.
I love the example you have about reducing the numbers of parameters being estimated and again I think these should go into one of the package articles.
foi_index as used in the fit_seromodel function seems a little unsafe to me. I also wasn't sure what this means from the function documentation, "Integer vector specifying the age-groups for which force-of-infection values will be estimated. It can be specified by means of [get_foi_index]". To illustrate my confusion, I wasn't sure in your example in the above whether the first element of foi_index corresponded to the oldest-aged individuals or the youngest. How about adding structure to this input? e.g. a data frame with (year, foi_index) for time models and (age, foi_index) for age models?
Minor thing but do we want to @export this to users summarise_loo_estimate? Seems easily doable using just the loo::loo functionality.
Silly question but I couldn't find this on a quick look: do we unit-test the custom functions we use in Stan?
Relatedly, what would be our unit testing coverage for this PR if merged (i.e. covr::package_coverage())

ntorresd · 2024-09-09T17:25:26Z

Hi @ben18785, thanks a lot for your feedback!

I'll address some of your points in this PR and open issues to address the others, as there are some closely related to user's feedback collected during the previous user test:

naming:

tsur -> date_survey? I can imagine how we might, in the future, use surveys taken at different parts of the year (when looking for seasonal effects). This generality might prove useful. At the least, I would suggest we rename this variable to use full proper names.

Right now, this variable is named survey_year. For the time being I'll leave it as it is, as we don't plan to implement seasonal models in the near future. We can change this easily once we have reached that point (if we decide to implement it in serofoi and not elsewhere anyway).

sample_size -> n_sample. We consistently use n_x to mean a count of x throughout the package, and I'd suggest we do so here.

I'll implement this one before merging for the sake of names consistency.

articles

I love the seroreversion model example you've given in this PR. Could it form the basis of a package article about fitting models with seroreversion?

I've opened #205 to address this.

I love the example you have about reducing the numbers of parameters being estimated and again I think these should go into one of the package articles.

I think we should address this one in #204 , opened by one of our users @JDConejeros during the user test; it's a good chance to give a compelling explanation of this concept.

plotting:

With the plotting, is there a way to suppress the written text at the top? I can imagine people wanting to put these into a publication and they may not want the written text.

I added a comment in #202 about this, so we can discuss this further over there.

foi_index

foi_index as used in the fit_seromodel function seems a little unsafe to me. I also wasn't sure what this means from the function documentation, "Integer vector specifying the age-groups for which force-of-infection values will be estimated. It can be specified by means of [get_foi_index]". To illustrate my confusion, I wasn't sure in your example in the above whether the first element of foi_index corresponded to the oldest-aged individuals or the youngest. How about adding structure to this input? e.g. a data frame with (year, foi_index) for time models and (age, foi_index) for age models?

Yes, this is a must. I opened #206 to address this.

Export summarise_loo_estimate

Minor thing but do we want to @export this to users summarise_loo_estimate? Seems easily doable using just the loo::loo functionality.

I'd rather keep for the time being. During the user test it was useful to have it exported.

Unit testing:

[...] * Relatedly, what would be our unit testing coverage for this PR if merged (i.e. covr::package_coverage())

I haven't run it yet, but it should be very low as the only part of the code covered by unit testing right now is the data simulation module. We've discussed unit testing with @jpavlich and we decided to start over as the structure of the packaged has changed so much. After merging this PR to dev, I'll open some issues about this. The idea is to have >90% coverage before submitting to CRAN.

Silly question but I couldn't find this on a quick look: do we unit-test the custom functions we use in Stan?

Not a silly question at all. As mentioned before, we are not testing for any of this right now. We should implement this before merging to main, so this is something we will discuss soon enough.

github-actions · 2024-09-09T22:32:32Z

This pull request:

Adds 1 new dependencies (direct and indirect)
Adds 1 new system dependencies
Removes 0 existing dependencies (direct and indirect)
Removes 0 existing system dependencies

(Note that results may be inacurrate if you branched from an outdated version of the target branch.)

ntorresd · 2024-09-10T21:22:52Z

This PR closes:

ntorresd added 30 commits July 31, 2024 18:03

clean R and tests folders

d97c921

feat: add constant models with and without seroreversion

4dfbf46

delete old modules

1fe63f6

add config file with default priors and distribution indexes

998a182

add input validation utilities

c86101a

feat: add working version of fit_seromodel using functional specifi…

ba67365

…cation of priors

feat: add clean seroprevalence visualisation functions

9ca8276

feat: add age_seroreversion model

58a838d

feat: add foi_index (old chunks) default behavior

760096d

feat: add plot_foi_estimates with option to add foi trend

15aad8d

change model name for models without seroreversion to *_no_seroreversion

a371ac3

feat: add age-varying model without seroreversion

6389d3c

refac: change stan functions names to avoid ambiguity

3a61712

feat: add time varying model with and without seroreversion

fdcfe9b

remove unnecessary line in probability_exact_time_varying

36cf0b2

doc: add documentation for new functions

ea07ee8

refac: move data simulation validation functions to validation module

ea6093f

fix: change to extract_central_estimates to deal with 1-time estimates

962df9e

feat: add 'plot_rhats' function

aedfc4f

remove unnecessary stan file

daa6873

fix: add error message for constant model exceptions in plotting func…

fd75b70

…tions

feat: add summarise_model and plot_summary functions

8ecde6b

feat: add new plot_seromodel function

a93075b

fix: change reference to config file

260e74e

fix: correct stan models reference

afe8512

doc: add export tag to plot_seromodel

d009c50

fix: remove uneccessary argument call in plot_seromodel

de1cca9

feat: introduce 'summarise_loo_estimate' to simplify 'summarise_serom…

80b5ac1

…odel' This function allows to extact an specific loo estimate for model parameters like 'seroreversion_rate' or 'foi'. This introduce changes in: - 'summarise_seromodel' - 'plot_summary' - 'plot_seromodel'

refac: move 'summarise_seromodel' to a separate file

5473e03

feat: add convergence field to summary in 'summarise_seromodel'

9684420

ntorresd added 3 commits August 21, 2024 11:40

fix: correct serofoi-package.Rd encoding

4dfbd35

doc: move vignettes to articles

4f236be

change version tag to 1.0.0

04b0dc3

fix: remove survey_year from plot_serosurvey()

5a4a554

ntorresd added 3 commits August 22, 2024 11:49

doc: add some examples in functions' documentation

8d3ed76

Add examples for: - `fit_seromodel` - `get_foi_index` - `plot_serosurvey`

doc: add complementary information to functions' documentation

c7d8ac8

doc: update documentation

b56a8a9

ben18785 reviewed Aug 30, 2024

View reviewed changes

ntorresd added 3 commits September 9, 2024 17:28

refac: change sample_size for n_sample across the package

a671e4d

doc: update documentation

6195b5d

change version to 1.0.1

e3092d3

ntorresd merged commit 6f17075 into dev Sep 9, 2024
2 of 8 checks passed

ntorresd deleted the dev-full-refac branch September 9, 2024 22:46

This was referenced Sep 10, 2024

refactor prepare_serodata #191

Closed

Priors functions #193

Closed

Pass argument from fit_seromodel() to rstan::sampling() via ... #118

Closed

ntorresd mentioned this pull request Sep 10, 2024

Modelling workflow #61

Closed

ntorresd mentioned this pull request Sep 26, 2024

Add structure to foi_index across the package #221

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Full refactorization of modelling and visualisation functions #200

Full refactorization of modelling and visualisation functions #200

ntorresd commented Aug 1, 2024 •

edited

Loading

github-actions bot commented Aug 21, 2024

github-actions bot commented Aug 21, 2024

github-actions bot commented Aug 22, 2024

ben18785 Aug 30, 2024

ben18785 Aug 30, 2024

ntorresd Sep 9, 2024

ben18785 commented Aug 30, 2024

ntorresd commented Sep 9, 2024

github-actions bot commented Sep 9, 2024

ntorresd commented Sep 10, 2024

Full refactorization of modelling and visualisation functions #200

Full refactorization of modelling and visualisation functions #200

Conversation

ntorresd commented Aug 1, 2024 • edited Loading

github-actions bot commented Aug 21, 2024

github-actions bot commented Aug 21, 2024

github-actions bot commented Aug 22, 2024

ben18785 Aug 30, 2024

Choose a reason for hiding this comment

ben18785 Aug 30, 2024

Choose a reason for hiding this comment

ntorresd Sep 9, 2024

Choose a reason for hiding this comment

ben18785 commented Aug 30, 2024

ntorresd commented Sep 9, 2024

github-actions bot commented Sep 9, 2024

ntorresd commented Sep 10, 2024

ntorresd commented Aug 1, 2024 •

edited

Loading